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Abstract 



In this paper, we present a novel algorithm for piecewise linear regression which can learn 
continuous as well as discontinuous piecewise linear functions. The main idea is to re- 
peatedly partition the data and learn a liner model in in each partition. While a simple 
algorithm incorporating this idea does not work well, an interesting modification results in 
a good algorithm. The proposed algorithm is similar in spirit to fc-means clustering algo- 
rithm. We show that our algorithm can also be viewed as an EM algorithm for maximum 
(^T) \ likelihood estimation of parameters under a reasonable probability model. We empirically 

demonstrate the effectiveness of our approach by comparing its performance with the state 
1 of art regression learning algorithms on some real world datasets. 

Keywords: Regression, Mixture Models. 



1. Introduction 



In a regression problem, given the training dataset containing pairs of multi-dimensional 
feature vectors and corresponding real- valued target outputs, the task is to learn a func- 
tion that captures the relationship between feature vectors and their corresponding target 
outputs. 

Least square regression and support vector regression are well known and generic ap- 
proac hes for regression learning problems ( Bishopl . 2006 : Hastie et al. . 2001 : Smola and Scholkopi 



1998). In the least squares approach, nonlinear regression functions can be learnt by using 
user-specified fixed nonlinear mapping of feature vectors from original space to some suit- 
able high dimensional space though this could be computationally expensive. In support 
vector regression (SVR), kernel functions are used for nonlinear problems. Using a non- 
linear kernel function, SVR implicitly transforms the examples to some high dimensional 
space and finds a linear regression function in the high dimensional space. SVR has a large 
margin flavor and has well studied performance guarantees. In general, SVR solution is not 
easily interpretable in the original feature space for nonlinear problems. 

A different approach to learning a nonlinear regression function is to approximate the 
target function by a piecewise linear function. Piecewise linear approach for regression 
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problems provides better understanding of the behavior of the regression surface in the 
original feature space as compared to the kernel-based approach of SVR. In piecewise linear 
approaches, the feature space is partitioned into disjoint regions and for every partition a 
linear regression function is learnt. The goal here is to simultaneously estimate the optimal 
partitions a nd linear mode l for e ach partition. This problem is hard and is computationally 



intractable (jPaoletti et all . 120071 ) 



The simplest piecewise linear function is either a convex or a concave piecewise linear 
function which is represented as a maximum or minimum of affine functions. A generic 
piecewise linear regression function can be represented a s a sum of these convex /concav e 
piecewise linear functions ( Breiman . 19931 ; Wang and Sun , 2005 ; Magnani and Bovd . 20091 ) . 

In this paper we present a novel method of learning piecewise linear regression functions. 
In contrast to all the existing methods, our approach is capable of learning discontinuous 
functions also. We show, through empirical studies, that this algorithm is attractive in 
comparison to the SVR approach as well as the hinge hyperplanes method which is among 
the best algorithms for learning piecewise linear functions. 

Existing approaches for piecewise linear regression learning can be broadly classified 
into two classes. In the first set of approaches one assumes a specific form for the function 
and estimates the parameters. Form of a regression function can be fixed by fixing the 
number of hyperplanes and fixing the way these hyperplanes are combined to approximate 
the regression surface. In the second set of approaches, the form of the regression function 
is not fixed apriori. 

In fixed structure approaches we search over a parameterized family of piecewise linear 
regression functions and the parameters are learnt by solving an optimization problem to, 
typically, minimize the sum of the squared errors. Som e examples of such methods are 



mixture of experts and hierarc hical mixture of experts (jJacobs et al 
19971 : 1.Tordan and Jacobs! . Il994h models. 



19911 : IWaterhousd . 



In the set of approaches where no fixed structure is assumed, regression tree ([Breiman et al 



1984 ; Javadeva and Chandra! 2002 ) is the most widely used method. A regression tree is 



built by binary or multivariate recursive partitioning in a greedy fashion. Regression trees 
split the feature space at every node in such a way that fitting a linear regression function 
to each child node will minimize the sum of squared errors. This splitting or partitioning 
is then applied to each of the child nodes. The process continues until the number of data 
points at a node reaches a user-specified minimum size or the error becomes smaller than 
some tolerance limit. In contrast to decision trees where leaf nodes are assigned class la- 
bels, leaf nodes in regression trees are associated with linear regression models. Most of 
the algorithms for learning regression trees are greedy in nature. At any node of the tree, 
once a hyperplane is learnt to split the feature space, it can not be altered by any of its 
child nodes. The greedy nature of the method can result in convergence to a suboptimal 
solution. 

A more refined regres sion tree approach is hinging hyperplane method ( Breiman . 1993; 



I reg: 

Pucar and Sjobergl . Il998l ) which overcomes several drawbacks of regression tree approach . 



A hi nge function is defined as maximum or minimum of two affine functions (jBreimanl . 



1993). In the hinging hyperplane approach, the regression function is approximated as a 



sum of these hinge functions where the number of hinge functions are not fixed apriori. The 
algorithm starts with fitting a hinge function on the training data using the hinge finding 
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algorithm (jBreimanl . Il993l ). Then, residual error is calculated for every example and based 
on this a new hinge function may be added to the model (unless we reach the maximum 
allowed number of hinges). Every time a new hinge function is added, its parameters are 
found by fitting the residual error. This algorithm overcomes the greedy nature of regression 
tree approach by providing a mechanism for re-estimation of parameters of each of the earlier 
hinge function whenever a new hinge is added. Overall, hinge hyperplanes algorithm tries 
to learn an optimal regression tree, given the training data. 

A diff erent greedy approach f or piecewise linear regression learn ing is bounded error 
approach (jAmaldi and Mattavellil . 120021 : iBemporad et all l200.j l2005h . In bounded error 
approaches, for a given bound (e > 0) on the tolerable error, the goal is to learn a piecewise 
linear regression function such that for every point in the training set, the absolute difference 
between the target value and the predicted value is l ess than e. This prope r ty is called 
bounded error property. Greedy heuristic algorithms ( Bemporad et al. . 20031 . 20051 ) have 
been proposed to find such a piecewise linear function. These algorithms start with finding 
a linear regression function which should satisfy the bounded error property for as many 
points in the training set as possible. This problem is kn own as maximum feasi b le sub - 
system problem (MAX-FS) and is shown to be NP-hard (jAmaldi and Mattavellil . |2002| ). 
MAX-FS problem is repeated on the remaining points until all points are exhausted. So 
far, there are no theoretical results to support the quality of the solution of these heuristic 
approaches. 

Most of the existing approaches for learning regression functions find a continuous ap- 
proximation for the regression surface even if the actual surface is discontinuous. In this 
paper, we present a piecewise linear regression algorithm which is able to learn both con- 
tinuous as well as discontinuous functions. 

We start with a simple algorithm that is similar, in spirit, to the fc-means clustering 
algorithm. The idea is to repeatedly keep partitioning the training data and learning a 
hyperplane for each partition. In each such iteration, after learning the hyperplanes, we 
repartition the feature vectors so that all feature vectors in a partition have least predic- 
tion error with the hyperplane of that partition. We call it if-plane regression algorithm. 
Though we are not aware of any literature where such a method is explicitly proposed and 
investigated for learning regression functions, similar ideas have been proposed in related 
contexts. For example, a simil ar problem is addressed in the system identification literature 
(jAmaldi and Mattavellil . |2002j) . A probabilistic version of suc h an idea was discussed under 
the title mixtures of multiple linear regression (jBishopl . 120061 . Chapter 14). 

This -ftT-plane regression algorithm is attractive because it is conceptually very simple. 
However, it suffers from some serious drawbacks in terms of convergence to non-optimal 
solutions, sensitivity to additive noise and lack of model function. We discuss these issues 
and based on this insight propose new and modified X-plane regression algorithm. In the 
modified algorithm also we keep repeatedly partioning the data and learning a linear model 
for each partition. However, we try to separtely and simultaneously learn the centers of the 
partions and the corresponding linear models. Through empirical studies we show that this 
algorithm is very effective for learning piecewise linear regression surfaces and it compares 
favourably with other state-of-art regression function learning methods. 

The rest of the paper is organized as follows. In Section [2] we discuss i^-plane regression 
algorithm, its drawbacks and possible reasons behind them. We then propose modified 
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if -plane regression algorithm in Section [3j We also show that modified if-plane regression 
algorithm monotonically decreases the error function after every iteration. In Section [5] we 
show the equivalence of our algorithm with an EM algorithm in limiting case. Experimental 
results are given in Section [5j We conclude the paper in Section [6J 

2. if -Plane Regression 

We begin by defining a if-piecewise affine function. We use the notation that a hyperplane 
in $t d is parameterized by w = [w T b] T £ $l d+1 where w £ ^ft d and b £ 9?. 

Definition 1 A function f : — > 5ft, is called K -piecewise affine if there exists a set of 
K hyperplanes with parameters (wi, &i ),-■■, ("Wr-, € ((\Vj, 6j) 7^ (\Vj,bj),i ^ j), 

and sets §1, §k C ^Si d (which form a partition of ^R d ) , such that, /(x) = wTx + frfc, Vx £ 

From the definition above, it is clear that (wjx + 6j — /(x)) 2 > (w^x + b^ — /(x)) 2 = 
0,Vx £ Sk,yj ^ k. Also, note that a if-piecewise affine function may be discontinuous. 

if-Plane Regression 

Let S = {(xi,t/i), . . . , (xjy, Vn)} be the training dataset, where (x n ,y n ) £ ^ x S, Let 
x n = [x^ n = 1 . . . N. if-plane regression approach tries to find a pre-fixed number 

of hyperplanes such that each point in the training set is close to one of the hyperplanes. 
Let K be the number of hyperplanes. Let k = 1 ... if, be the parameters of the 
hyperplanes. if -plane regression minimizes the following error function. 

N 

£(©) = E u - y^) 2 

n=l fce ^- K > 

where = {wi, . . . ,%}. Given the parameters of if hyperplanes, wi . . . define sets 
Sk, k = 1 . . . if, as Sjc := {x n | k = argmin^g^ ^j(wjx n — y n ) 2 } where we break ties by 
putting x n in the set Sk with least k. The sets Sk are disjoint. We can now write E(Q) as 

K 

£( ) = E E - y^ 2 c 1 ) 

fe=l x n GS fc 

If we fix all Sfe, then can be found by minimizing (over w) X^x^gS^ (^ T -^ n ~~ Vn) 2 ■ How- 
ever, in i£(0) defined in equation ([1]), the sets Sk themselves are function of the parameter 
set 6 = {wi, . . . , w K }. 

To find which minimize E(@) in ([I]), we can have an EM-like algorithm as follows. 
Let, after c th iteration, the parameter set be C . Keeping C fixed, we first find sets 

= {x ra I k = argmin Jg { lj j^}(x^w^ — y n ) 2 }, k = 1...K. Now we keep these sets 
S£, k = 1 . . . if, fixed. Thus the error function becomes 

K K 
E °( Q ) = E E (w^n-yn) 2 = E^(^) 

fe=l x„gS£ fe=l 
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Algorithm 1: iT-plane regression 

Input: {(xi,y 1 ),...,(x JV ,2/iv)} 
Output: {wi . . . Wf() 
begin 

Step 1: Initialize w?, k = 1 . . . K, Initialize c = 0. 
Step 2: Find sets S c k ,k = l...K 

S c k = {x„ | k = argmin ie{1 K} (x^w^ - y n ) 2 } 
Step 3: Find w£ +1 , fc = 1 . . . K, as follows 

Step £ Find sets k = l...K 

s k +1 = { yL n\k = argmin i6{1>>>Ji - } (x^wJ fl - y n ) 2 } 

Step 5: Termination Criteria 

if S c k +1 = S%, k = 1 . . . K then 
stop 

else 

c = c+ 1 
go to Step 3 
end 
end 



wf 1 



2^x n eS? x n x n 



where superscript c denotes the iteration and hence emphasizes the fact that the error 
function is evaluated by fixing the sets S k , k = 1 . . . K, and 

£fc(w fc )= Yl i^n-Vn) 2 . (2) 



Thus, minimizing E c (@) with respect to boils down to minimizing each of E?(Wk) with 
respect to w&. For every k E {1, . . . , K}, a new weight vector w£ +1 is found using standard 
linear least square solution as follows. 



Wfc +1 = argmin Wfc ^ x ™ ~ 2/n) 2 = [ ^ [ Vn±n \ ^ 



J" 1 ! 

X)-), X)7, x n GS 1 ^ 



Now we fix Q c+1 and find new sets S k +1 , k = 1 . . . K, and so on. We can now summarize 
i^-plane regression algorithm. We first find sets S k , k = 1...K, for iteration c (using 
w?, k = 1 . . . K). Then for each k = 1 . . . K, we find w£ +1 (as in equation (|3|)) by minimizing 
E^(wfe) which is defined in equation (J2J. We keep on repeating these two steps until there 
is no significant decrement in the error function E(@). E(@) does not change when the 
weight vectors do not change or sets S^, k = 1 . . . K, do not change. The complete iT-plane 
regression approach is described more formally in Algorithm Q3 
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Figure 1: (a) Points sampled from a triangle shaped function f*(x), (b) function f(x) learnt 
using iT-plane regression algorithm given the points sampled from function f*(x). 



2.1 Issues with K-Plane Regression 

In spite of its simplicity and easy updates, iC-plane regression algorithm has some serious 
drawbacks in terms of convergence and model issues. 

1. Convergence to Non-optimal Solutions 

It is observed that the algorithm has serious problem of convergence to non-optimal solution. 
Even when the data is generated from a piecewise linear function, the algorithm often fails 
to learn the structure of the target function. 

Figure QJa) shows points sampled from a concave (triangle shaped) 2-piecewise affine 
function on the real line. At the horizontal axis, circles represent set Si and squares represent 
set #2, where Si and S2 constitute the correct partitioning of the training set in this problem. 
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We see that convex hulls of sets Si and S2 are disjoint. This 2-piecewise affine function can 
be written (as per defn. [TJ by choosing Si, S2 to be the convex hulls of Si and S2. 

Figure [T](b) shows the 2-piecewise linear function learnt using if-plane regression ap- 
proach for a particular initialization. S[ (represented as circles) and S 2 (represented as 
squares) are sets corresponding to the two lines in the learnt function. Here, ii-plane re- 
gression algorithm completely misses the shape of the target function. We also see that 
convex hulls of sets S[ and S' 2 intersect with each other. 

2. Sensitivity to Noise 

It has been observed in practice that the simple if-plane regression algorithm is very sen- 
sitive to the additive noise in the target values in training set. Under noisy examples, the 
algorithm performs badly. We illustrate it later in Section [5l 

3. Lack of Model Function 

The output of the if-plane regression algorithm is a set of K hyperplanes. But this algorithm 
does not provide a way to use these hyperplanes to predict the value for a given test point. In 
other words, if -plane regression algorithm does not have any model function for prediction. 
We expand this issue in the next section. 

3. Modified if -Plane Regression 

As we have mentioned, given the training data, {(xi,yi), . . . , (xn,Un)}, the if -plane re- 
gression algorithm outputs if hyperplanes, wt, k = 1...K. To convert this into a 
proper if-piecewise linear model in ^R d , we also need to have a if -partition of $l d such 
that in the k th partition, the appropriate model to use would be wj£. We could attempt 
to get such a partition of 3i d by considering the convex hulls of St, k = 1 ... if (where 
S£ = {x n I k = a,Tgmm.j(y n — x^w*) 2 }). However, as we saw, under the if -plane regression, 
the convex hulls of such Sj£ need not be disjoint. Hence another method to get the required 
partition is as follows. Let jLtjjl be the mean or centroid of S£. Then, for any point, x G $l d , 
our prediction could be y = x T w*, where j is such that ||x — < ||x — V7c ^ j 

(break ties arbitrarily). This would define a proper model function with the hyperplanes 
obtained through if -plane regression. However, this may not give good performance. Often, 
the convex hulls of sets, St, k = 1 ... if (learnt using if -plane regression), have non-null 
intersection because each of these sets may contain points from different disjoint regions of 
$i d (for example, see Figure [1]). In such cases, if we re-partition the training data using dis- 
tances to different /u,j , we may get sets much different from S^ and hence our final prediction 
on even training data may have large error. The main reason for this problem with if-plane 
regression is that the algorithm is not really bothered about the geometry of the sets S&; 
it only focuses on w& to be a good fit for points in set S/%. Moreover, in situations where 
same affine function works for two or more disjoint clusters, fc-plane regression will consider 
them as a single cluster as the objective function of fc-plane regression does not enforce that 
points in the same cluster should be close to each other. As a result, the clusters learnt 
using if -plane regression approach will have overlapping convex hulls and some times even 
their means may be very close to each other. This may create problems during prediction. 
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If we use the hyperplane whose corresponding cluster mean is closest to a point, then we 
may not pick up the correct hyperplane. This identifiability problem of A'-plane regression 
approach results in poor performance. 

Motivated by this, we modify A-plane regression as follows. We want to simultaneously 
estimate w k , k = 1 ... K and fi k , k = 1 . . . K, such that if, is a good fit for k th partition, 
all the points in k th partition should be closer to fi k than any other n*. Intuitively, we can 
think of fi k as center of the (cluster or) set of points S k . However, as we saw from our 
earlier example, if we simply make n k as the centroid of the final Sk learnt, all the earlier 
problem still remain. Hence, in the modified A-plane regression, we try to independently 
learn both and n k from the data. To do that, we add an extra term to the objective 
function of A-plane regression approach which tries to ensure that all the points of same 
cluster are close together. 

As earlier, let the number of hyperplanes be A. Here, in the modified A-plane regres- 
sion, we have to learn 2A parameter vectors. Corresponding to k th partition, we have two 
parameter vectors, w& £ 9? d+1 and fi k £ represents parameter vector of the hyper- 

plane associated with the k th partition and /x fc represents center of the k th partition. Note 
that we want to simultaneously learn both w k and fx k for every partition. 

The error function minimized by modified A-plane regression algorithm is 

N 

E (®)=J2, r mil1 ^ K™l*n ~ Vn? + l\\*n ~ »k\\ 2 ] (4) 
^— ' k£{l,...,K} 
n=l 

where = {(wi, /ZjJ, . . . , ^k)} an< ^ 7 is a user defined parameter which decides 
relative weight of the two terms. 

Given 0, we define sets S k , k = 1 ... A, as 

S k : = |x n | k = argmin ie{1 K} [(wjx n - y n ) 2 + 7||x n - /^-|| 2 ] j 

where we break ties by putting x n in the set Sk with least k. The sets Sk are disjoint. We 
can now write £7(0) as 

K 

E ( Q ) = EE _ Vn ^ 2 + 7 ' i x ™ _ ^ 1 1 2 (5) 

k=\ x„es t 

As can be seen from the above, now, for a data point, x n to be put in Sk, we not 
only need w^x n to be close to y n as earlier, but also need x n to be close to fi k , the 
'current center' of S k . The motivation is that, under such a partitioning strategy, each 
S k would contain only points that are close to each other. As we shall see later through 
simulations, this modification ensures that the algorithm performs well. As an example of 
where this modification is important, consider learning a piecewise linear model which is 
given by same affine function in two (or more) disjoint regions in the feature space. For any 
splitting of all examples from these two regions into two parts, there will be a good linear 
model that fits each of the two parts. Hence, in the A-plane regression method, the E(Q) 
function (cf.eq.([l])) would be same for any splitting of the examples from these two regions 
which means we would not learn a good model. However, the modified A-plane regression 



8 



if-PLANE Regression 



approach will not treat all such splits as same because of the term involving [i k . This helps 
us learn a proper piecewise linear regression function. We illustrate this in Section [5l 

Now consider finding to minimize E(Q) given by equation ([5]). If we fix all Sk, then 
Wfc and ii k can be found by minimizing (over w, /x) ]C Xng s fe (w T x ra — y n ) 2 + 7||x n — /x|| 2 . 
However, in E(@) defined in equation ([5]), the sets Sk themselves are functions of the 
parameter set 8 = {(wi,^), . . . , (wr-, fx^)}- 

To find which minimize E(Q) in ([5]), we can, once again, have an EM-like algorithm 
as follows. As earlier, let the parameter set after c th iteration be O c . Keeping O c fixed, we 
find the sets S k , k = 1 ... if , as follows 

S k = { x « I k = argmin je{1 ^ } [(x^w^ - y n f +7||x n - /x^|| 2 ]} (6) 

Now we keep these sets S k fixed. Thus the error function becomes 

K K 

£ C ( ) = E E [(w^x ri -y n ) 2 + 7 ||x n -^|| 2 ]=^S c (w, )Mfc ) 

fc=lx n GS= k=l 

where superscript c denotes the iteration and emphasizes the fact that the error function 
is evaluated by fixing the sets S k , k = 1 . . . if. Thus minimizing E c (@) with respect to @ 
boils down to minimizing each of E k (w k , with respect to (wfc,/x fc ). Each E k (w k , fi k ) is 
composed of two terms. The first term depends only on and and it is the usual sum of 
squares of errors. The second term depends only on fi k and it is the usual cost function of 
if -means clustering. Thus, the update equations for finding w£ +1 and fi c k +1 , k = 1 . . . if , 
are 

K 

= ar g min w fc ^-E C (w/,/*j-) = argmin w ^ (w T x n - y n ) 2 
j=i x„eS£ 

= [ ^ *n*n] [ E Vnin \ ( 7 ) 



K 

H c k +1 = argmin^ ^ E c (\Vj, fij) = argmin^ ^ ||x n -/x| 



3=1 



L E x « ( 8 ) 



^ x„6S 



Once we compute G c+1 , we find new sets S^! +1 , k = 1 ... if, and so on. 

In summary, the modified if -plane regression algorithm works as follows. We first find 
sets S k , k = 1 ... if , for iteration c (using (w£, /x£), k = 1 . . . if) as given by eq.(|6]). Then 
for each k = 1...K, we find (w£ +1 ,/x£ +1 ) (as in equation ([7]) and ([5])) by maximizing 
E k (w k , fi k ). We keep on repeating these two steps until there is no significant decrement in 
the error function E(Q). The complete description of modified if -plane regression approach 
is given in Algorithm [2j 
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Algorithm 2: Modified -fT-plane regression 



Input: {(x 1 ,y 1 )...(x N ,y N )} 
Output: {(wi,^)...^,^)} 
begin 

Step 1: Initialize (w2, /LtS), k = 1 . . . K. Initialize c = 0. 
Step 2: Find S k , k = 1 . . . K , as follows 

S c k = {*n\k = argmin ie{1 . K} [(x^w| - y n ) 2 + 7||x n - /^|| 2 ]} 



Step 3: Find fc = 1 . . . K, as follows 



E [ E 



Step 4: Find SJr +1 , fc = 1 . . . K, as follows 

S k +1 = { X « I fc = ar g min jG{l...^}[(x^ +1 - Vn ) 2 + 7 ||x n - /1^ +1 || 2 ]} 

Step 5: Termination Criteria 
if S c k +1 = S c k ,Vk then 

stop 
else 

c = c+1 

go to Step 3 
end 
end 



Monotone Error Decrement Property 

Now we will show that modified -RT-plane regression algorithm monotonically decreases the 
error function defined by equation ([H)0 

Theorem 1 Algorithm^ monotonically decreases the cost function given by equation 
after every iteration. 

Proof We have 

K 

^ C ( 0C ) = E E [(x^-y„) 2 + 7l|x„-/4|| 2 ] 

k — 1 x.<n(zS^ 

1. Note that this does not necessarily mean that we find the global minimum of the error function. More 
importantly, we can not claim that minimizing the error as defined would lead to learning of a good piece- 
wise linear model. We note here that the simple A'-plane regression algorithm also results in monotonic 
decrease in the error as defined for that algorithm even though it may not learn good models. However, 
the fact that the algorithm continuously decreases the error at each iteration, is an important property 
for a learning algorithm. 
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Given the sets S%, k = 1...K, parameters (w^ +1 , k = 1...K, are found using 
equation © and (jSJ), in the following way. 

= ar g min w fc E (™k*n ~ Vn) 2 



H c k +1 = argmin^ E ||x„-/x fc [| 2 



Thus, we have 



E &l - Vnf > £ (^w£ +1 " ?/n) 2 , fe = 1 • • • K 

E ||x n -/4|| 2 > ^ ||x n -/, c fc +1 || 2 , fe = l...X 

This will further give us 

£ (x^- yn ) 2 + 7 ||x n -^,|| 2 > ^ (x^ +1 - yn ) 2 + 7 ||x n -^ +1 || 2 , Vfc 



^ EE (^,- yn ) 2 +7 ||x n -^n 2 >x; E (x^ i - yn ) 2 +7 ||x n -^+ i n 2 

fc=ix n eS<; fc=ix„eS£ 
=> E c (@ c ) > E C (Q C+1 ) (9) 

Given Q c+1 , sets Sp~ , = 1 . . . if , are found as follows 

= {x n | A: = argmin i6{L . ^ } [(x^ c+1 - y n ) 2 + 7 ||x n - ^ +1 \\ 2 }} 

Using k = 1 ... if, we can find £ C+1 (G C+1 ), which is 

£ C+1 (e C+1 ) = E E [(x^ +1 - y n) 2 + 7l|x„-^ +1 || 2 ] 

= EE E [(x^- yn ) 2 +7 ||x n -^n 2 ] 

By the definition of sets S^ +1 , Vx n G , we have, 

(X^wf 1 " yn) 2 + 711*. " /4 +1 H 2 < (^Wf 1 " Vnf + 7 ||x n - ^ +1 || 2 , Vj ^ * 

which is also true for any x n € S"^ 1 H 5|. Thus 

^ c+1 (e c+1 ) < EE E [(x^f 1 - yn ) 2 + 7 || Xn - / xf 1 || 2 ] 

= EE K*f 1 - f«) 2 + ^ - ^ c+1 ll 2 ] = ^ c (e c+1 ) (io) 

3=1 x„e5| 
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Combining © and {ID]), we get E C (G C ) > E C (Q C+1 ) > E C+1 {Q C+1 ). Which means after one 
complete iteration modified iT-plane regression Algorithm decreases the error function. ■ 



4. EM View of Modified fT-Plane Regression Algorithm 

Here, we show that modified .fT-plane regression algorithm presented in Section [3] can be 
viewed as a limiting case of an EM algorithm. In the general iT-plane regression idea, 
the difficulty is due to the following credit assignment problem. When we decompose the 
problem into K subproblems, we do not know which x n should be considered in which 
subproblem. We can view this as the missing information in the EM formulation. 

Recall that S = {(xi, y±), . . . , (xjy, Vn)} is the training data set. In the EM framework, 
S = {(xi, yx), . . . , (xtv, Vn)} can be thought of as incomplete data. The missing data would 



be z r 



[Znl 



ZnK\ i where z nk € {0,1}, Vn, Vfc, such that, Yl 



K 
fc=l 



1, Vn. Then 



z nk are defined as, 
r i, if(w T 



x„ 



Vr. 



+ 7||x n - fi k \\ z < 

0, otherwise 
This gives us the following probability model 



(wjx r , 



+ 7||x n - m\\ 2 , Vj / k 



P(x n ,y n \z nk = 1,6) = p(x n ,y n \w k ,Hk) = p(Xn|/*fc)p(yn|x„, w fc ) 
P(x n ,y n |z n ,6) 



Ef=l z nfcP(Xn|Mfe)p(yn|x n ,W fc ) 



(11) 



In our formulation, /x fc represents the center of the set of all x for which the k linear model 
is appropriate. Hence we take p(x|/x fc ) = J\f(fx k , ^J), a multivariate Gaussian in which the 
covariance matrix is given by —I, where -(e, 7 > 0) is a variance parameter, and / is the 
identity matrix. This covariance matrix is common for all K components. We assume that 
the target values given in the training set may be corrupted with zero mean Gaussian noise. 



Thus, for k component, the target value is assigned using w k as y = w^x + e, where e is 
Gaussian noise with mean and variance e. Variance e is kept same for all K components. 
Thus, p(y\x,w k ) = A/"(w^x n , e) , a Gaussian with mean Wi, x n and variance 6. Thus, 

e 

1 



p(x n ,y n \w k ,fj, k ) 



-/)A/-(w£x r 



72 



(27re)2 
I exp 



exp 



— I 



1 



(2vre) 



r exp 

2 



1 

27 



(w fe x n - y n ) ) 



W(.X> 



x,, 



M 2 ]) 



where L 



(27re) 



(d+1) 



Note that e and 7 are assumed to be fixed constant, instead of 
parameters to be re-estimated. Thus, the density model for incomplete data becomes 

K 



p(x n ,y n \@) 



^2 1 {k=a n } 

k=l 

iexp 



exp 



1 
27 



[(Vn 



W fc X r 



+ 7ll x r, 



Mfcl 



7T m , in [(^ 
z£ k 



+ 7||x n - n k \ 



(12) 
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Negative of the log-likelihood under the model given in equation (|12p is same as the error 
function minimized in the modified ET-plane regression algorithm. Hence, we can now 
compute the EM iteration for maximizing log- likelihood computed from (llip . 

However, the incomplete data log-likelihood under our probability model (|12p becomes 
non-differentiable due to the hard minimum function. To get around this, we change the 
probability model for incomplete data into a mixture model with mixing coefficients as part 
of 9: 



1 K 1 

p(x n ,y n |9) = -^a fc exp(- — [(y n - w^x n ) 2 + 7||x n - fi k \\ 2 }) (13) 
fc=l 



where a k = P{z nk = 1), Vn; a k > 0, £ fc=1 a k = 1, and 6 = {(a u w b /i x ), . . . , (ajf,Wjf,%)}. 
Note that here p(yn|x„, G) cx Y,k=i v x* ^jLf '^'m/H ]m ex P (~^(l/n - w^x n ) 2 ), which 



is same as the model described in IXu et al.1 (|l995l ) for a mixture of experts network. The 



incomplete data log-likelihood given by (|13p will now be smooth and we can use EM algo- 
rithm to maximize the likelihood. However, the model given in (|13p is somewhat different 
from the one in equation (|12p which was used in Section [3j 

We, now derive the iterative scheme under EM framework using the model specified by 
equation (|lip and (|13p and show that in the limit e — > 0, the iterative scheme becomes the 
modified i^-plane regression algorithm that we presented in Section [3j 

4.1 EM Algorithm 

We now describe EM algorithm with S = {(xi,yi), . . . , (xjv,yjv)} as incomplete data and 
S = {(xi, yi, zi), . . . , (xjv, Vn, z n)} as complete data and under the model specified by ([TT]) 
and (fT3|) . The complete data log-likelihood is 

N K N K 

'complete (©; 5) = In [ JT[P(x n ,y n ,z nfc |9)f™ fc ] = ^^z nfc ln [P(z nfe )P(x n , y n |z nA; , 9)] 

n=l k=l n=l k=l 

(y-w^x n ) z 7 ||x n - Hk\\ 



n=l fc=l 



In a k — In L 



2e 2e 



E-Step: In E-Step, we find Q(9,9 C ) which is the expectation of complete data log- 
likelihood. 



Q(9,9 C ) = E {zii ... iZjv} [/ complete (9;5)|9 c ] 

N K , _<ti_ N 2 ii 1 1 9 

= 2^2^[ lnafc ~ lnL 2e 2e J p ( z nfc = l|x n , 2M, 9 C 

n=l k=l 
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M-Step: In the M-Step, we maximize Q(0, B c ) with respect to 6 to find out new parameter 
set C+1 . This will give us following update equations. 



1 N 



n=l 

N N 



™t +1 = [^2 p ( z nk = l|x n ,y n ,e c )x n x^] 1 [Y^P{z nk = l|x n ,y n ,9 c )y n x T1 



n=l n=l 



c+1 ^2n=l P( Z nk — l[ x n; Vni Q C ) x t; 



Hn=l P ( Z nk = l|x n ,y n ,9 c ) 

where P{z nk = l|x n , y n , 6 C ) is given by 

a c k p(x n ,y n \z nk = 1, 9 C 



P{z nk = l|x„,y n ,G c ) 



J2f=i o c j p(y. n ,y n \z nk = 1, C ) 

"fc eX P ( ~ YeiiVn ~ X^) 2 + 7 ||Xn - /4|| 2 ]) 

EjLi exp ( - £[(y„ - x^wp 2 + 7 ||x n - /^ c || 2 ]) 

4.2 Limiting Case (e — > 0) 

Now consider lim e ^ P(z nk = l|x n ,y n ,9 c ). Let a c n = argmin j6{1 K} [{y n - x^wp 2 + 
7||x n — /i£|| 2 ]. When e — > 0, then in the denominator, the term corresponding to index 
a c n will go to zero most slowly and hence lim e -s>o P{z n k = l|x n , y n , © c ) = I{fc =a c}, where 
I| fe=a c } = 1 if k = a£ and zero otherwise. In this limiting case, the EM updates of w k and 
H k will be same as updates of modified If -plane regression algorithm. 

5. Experiments 

In this section we present empirical results to show the effectiveness of modified If -plane 
regression approach. We demonstrate how the learnt functions differ among various regres- 
sion approaches using two synthetic problems. We test the performance of our algorithm 
on several real datasets also. We compare our approach with hinging hyperplane algorithm 
which is the best state-of-art regression tree algorithm and with support vector regression 
(SVR) which is among the best generic regression approaches today. 

Dataset Description 

The two synthetic datasets are generated as follows: 

1. Problem 1: In this, points are uniformly sampled from the interval [0 5]. Then, 
for every point x the target values y are assigned as y = f(x) + e, where 



/(*) 



x, if < x < 1 

2 - x, if 1 < x < 2 

§(7-2x), if2<x<3.5 

k ±(2x-7), if3.5<x<5 
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and e is a Gaussian random variable with zero mean and variance 0.01. 500 points 
are generated for training and 500 points are generated for testing. 

2. Problem 2: Points are uniformly sampled from the interval [0 3]. Then, for every 
point x the target values y are assigned as y = f(x), where 



/(*) 



x, 
1, 

X, 



if < x < 1 
if 1 < x < 2 
if 2 < x < 3 



We also generate y' as y' = f{x) + e, where e is a Gaussian random variable with 
zero mean and variance 0.01. 300 points are generated for training and 300 points are 
generated for testing. 

Note that both the above functions are discontinuous. 

We also pre sent the exper i ment al comparisons on 4 'real' datasets downloaded from UCI 
ML repository (|A. Asuncion! . l2007h which are described in Table [TJ In our simulations, we 
scale all feature values to the range of [—1 1]. 



Data set 


Dimension 


$ Points 


Boston Housing 


13 


506 


Abalone 


8 


4177 


Auto-mpg 


7 


392 


Computer activity 


12 


8192 



Table 1: Details of real world datasets used from UCI ML repository. 



Experimental Setup 

We implemented iT-plane regression and modified iT-plane regression algorithms in MAT- 
LAB0 We have also implemented h inging hyperplane me thod in MATLAB. For support 
vector regression, we used Libsvm ( Chang and Lin . 200 ll ) code. All the simulations are 



done on a PC (Core2duo, 2.3GHz, 2GB RAM). 

Modified ET-plane regression has one user defined parameter which is 7. We search for 
the best value of 7 using 10-fold cross validation and use that value in our simulations. 
Both ivT-plane regression and modified -fT-plane regression approaches require K (number 
of hyperplanes) to be fixed apriori. In our experiments, we change the value of K form 2 to 
5. Similarly, in hinging hyperplane method, maximum number of hinge functions should be 
specified. In our simulations, this number is varied from 1 to 5. Support vector regression 
has three user defined parameters: penalty parameter C, width parameter a for Gaussian 
kernel and tolerance parameter e. Best values for these parameters are found using 10-fold 
cross-validation and the results reported are with these parameters. 



For K-p\ane regression, there is no specified model function which can be used to predict the value for 
a test point. In our simulations, to assign value for any test point using K-p\ane regression, we use the 
same methodology as modified _K"-plane regression approach. That is, using the w& learnt, we obtain 
sets Sk as explained in Section [2] then we find the k such that centroid of Sk is closest to the test point 
and then use that Wfe to predict the target. 



15 



Manwani and Sastry 



Method 


Parameters 


MSE 


K-plane 


# hyperplanes = 4 


0.0557 


Modified K-plane 


# hyperplanes = 4 


0.037 


Hinge Hyper plane 


# hinges = 6 


0.0451 


SVR 


C = 64, <r = 16, e = 2~ 5 


0.013 



Table 2: MSE of different regression approaches on problem 1. 



Simulation Results: Synthetic Problems 

Problem 1: Figure [2] shows functions learnt using different approaches on problem 1 and 
Table [2] shows MSE achieved with different approaches on a test set. Hinge hyperplane 
approach and support vector regression (SVR) methods give continuous approximations 
to the function i / (see Figs. E(c) and [21(d)). While SVR gets the shape of the function 
well, the function learnt using SVR is not piecewise linear. Figure [21(e) shows 4-piecewise 
affine function learnt using K-plane regression method. We see that the K-plane regression 
approach completely misses the shape of the function which results in a very high MSE. In 
contrast, as can be seen from figure 0(f), modified K -plane regression approach learns the 
discontinuous function / exactly (eventhough the function values given in training set are 
noisy) . 

Recall that in modified K-plane regression, we essentially partition the data and learn 
a hyperplane as well as a 'center' or 'mean' (which was called fi k in the algorithm) for each 
partition. The target function in this example has four linear pieces. If we got the exact 
partitioning of the input space then the ideal centers would be (0.5,1.5,2.75,4.25). The 
means learnt using modified K-palne regression approach are (0.495,1.495,2.745,4.25). This 
example demonstrates that our modified K-plane regression algorithm is robust to additive 
noise, and that it can learn discontinuous functions also well. This example also shows that 
the simple-minded K-plane regression performs poorly when there is noise in the training 
set. 

Problem 2: In this problem the target function is a 3-piecewise affine function and 
we show the functions learnt by different approaches on noise-free as well as noisy training 
set. Figure [3] shows functions learnt using different approaches given the noise- free training 
examples. As can be seen, all algorithms learn a very good approximation of the target 
function (when given noise- free training data). The hinge hyperplanes method and SVR 
learn a continuous approximation while the K-plane and modified K-plane methods learn 
the discontinuous function. 

We see that both K-plane and modified K-plane regression approach learn the function 
exactly. But MSE of K-plane method is much higher than modified K-plane method as 
can be seen in Table The reason is as follows. In problem 2, the three sets (defining 
the partitioning of the domain of the function) corresponding to the three affine functions 
are [0,1), [1,2) and [2,3]. Moreover, Vx € [0,1) U [2,3], f(x) = x. Thus the same affine 
function is assigned to two of the three disjoint sets. The K-plane regression approach tries 
to partition the training data so that for each partition we can learn a good function to fit 
the data; it does not care about whether the points in a partition are close together. Hence, 
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'0 1 2 3 4 5 '0 1 2 3 4 5 



(e) (f) 

Figure 2: (a) 4-piecewise affine function / described in problem 1; (b) function / corrupted 
by additive Gaussian noise; functions learnt using (c) hinge hyperplane algorithm, 
(d) support vector regression, (e) -ftT-plane regression approach and (f) modified 
-ftT-plane regression approach, given noisy samples of /. 
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TV JT J_l 1 

Method 


Parameters 


MSE 


Without 


K -plane 


# hyper planes = 3 


0.7917 


Noise 


Modified If -plane 


hyper planes = 3 


3.33x11) 




Hinge Hyper plane 


# hinges =13 


0.008 




SVR 


C = 1024, o- = 16, e = 2" 8 


0.0041 


With 


/f-plane 


# hyperplanes = 3 


0.1352 


Noise 


Modified If -plane 


# hyperplanes = 3 


0.011 




Hinge Hyper plane 


# hinges = 23 


0.0237 




SVR 


C = 1024, a = 16, e = 2~ 8 


0.0148 



Table 3: MSE of different regression approaches on problem 2. 



any partition of the the set [0, 1) U [2, 3] into two parts (including the case where one part is 
null) would result in roughly the same value of the error function for the If -plane method. 
But, for prediction on a new point, we have to use the nearness of the new point to centriod 
of the partitions. Hence, if the partitions are bad then the final MSE can be very large. 
In this problem, when If -plane regression is given noise free samples, it always learnt only 
two hyperplanes irrespective of the value of if used (with the sets (Sk) corresponding to 
the remaining partitions being empty). The means of the two partions learnt were 1.505 
and 1.4975. This clearly shows the algorithm has put [0, 1) U [2,3] into one partition. This 
leads to very poor prediction on test samples and high MSE in case of if -plane regression. 
In contrast, the means learnt using modified if -plane regression are 0.495, 1.5 and 2.505. 

In the second part of this problem, we have added noise to the true function values in 
the training set as explained earlier. Figure [H shows the functions learnt using different 
approaches given these noisy samples of the function. The MSE achieved by the learnt 
function on a test set under different algorithms are shown in Table El We see that only 
modified if -plane regression approach learns the target function exactly. 

The function learnt by if -plane regression is very poor and its MSE is also high. This 
shows that, unlike in the earlier case, the if -plane regression algorithm could not even get 
the two affine functions correctly. Given the shape of function learnt on this problem by 
if -plane regression when the examples are noise-free, we can see that this algorithm is very 
sensitive to additive noise. 

Both hinge hyperplanes method and SVR learn a good continuous approximation to the 
target function. However, these are not as good as the functions learnt by these algorithms 
on noise-free data of this problem. In contrast, the modified isT-plane regression algorithm 
learns the discontinuous function exactly under our additive noise also. It also achieves the 
minimum MSE which is nothing but the noise variance as can be seen in Table El 

Results on Real Datasets: 

We now discuss performance of modified if -plane regression algorithm in comparison with 
other approaches on different real datasets. The results provided are based on 10- repetitions 
of 10-fold cross validation. We show average values and standard deviation of mean square 
error (MSE) and the time taken. The results are presented in Table HE 
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0.5 1 1.5 2 2.5 3 



(e) 

Figure 3: (a) 3-piecewise affine function / described in problem 2; functions learnt using (b) 
hinge hyperplane algorithm, (c) support vector regression, (d) .ff -plane regression 
approach and (e) modified K-plane regression approach, given noise-free samples 
of /. 
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(b) 




0.5 1 1.5 2 2.5 3 




0.5 1 1.5 2 2.5 3 



(c) 




-0.5 



0.5 1 1.5 2 2.5 3 



3 

2.5 
2 

1.5 
0.5 
-0.5 



(d) 




0.5 



1.5 



2.5 



(0) 



(f) 



Figure 4: (a) 3-piecewise affine function / described in problem 2; (b) function / corrupted 
by additive Gaussian noise; functions learnt using (c) hinge hyperplane algorithm, 
(d) support vector regression, (e) -fT-plane regression approach and (f) modified 
-fT-plane regression approach, given noisy samples of /. 
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Method 


Parameters 


MSE 


Time (sec) 


K-plane 


# hyperplanes = 2 


17.15±0.85 


0.01 




# hyperplanes = 3 


27.47±2.74 


0.03±0.003 




# hyperplanes = 4 


30.29±1.77 


0.04±0.005 




# hyperplanes = 5 


39.67±9.11 


0.06±0.012 


Modified K-plane 


# hyperplanes = 2 


14.95±0.27 


0.006 


7 = 100 


# hyperplanes = 3 


14.72±0.53 


0.01±0.001 




# hyperplanes = 4 


14.25±0.62 


0.014±0.001 




# hyperplanes = 5 


13.92±0.78 


0.02±0.002 


Hinge Hyperplane 


# hinges = 1 


19.29±2.19 


0.01±0.003 




# hinges = 2 


16.45±1.34 


0.04±0.006 




# hinges = 3 


16.25±1.16 


0.07±0.006 




# hinges = 4 


16.06±0.98 


0.11±0.008 




# hinges = 5 


15.62±1.57 


0.14±0.015 


SVR 


C = 128, o- = 0.25, e = 2~ 8 


10.08±0.42 


0.17±0.01 



Table 4: Comparison results of modified K-p\ane regression approach with other regression 
approaches on Housing Dataset. 



Method 


Parameters 


MSE 


Time(sec) 


K-plane 


# hyperplanes=2 


10.44±0.17 


0.11±0.02 




# hyperplanes=3 


9.19±0.62 


0.14±0.01 




# hyperplanes=4 


9.87±1.45 


0.27±0.03 




# hyperplanes=5 


10.85±1.19 


0.39±0.05 


Modified K-plane 


# hyperplanes=2 


4.80±0.02 


0.02±0.002 


7=100 


# hyperplanes=3 


4.68±0.03 


0.04±0.005 




# hyperplanes=4 


4.69±0.03 


0.08±0.01 




# hyperplanes=5 


4.68±0.03 


0.08±0.01 


Hinge Hyperplane 


# hinges = 1 


4.73±0.06 


0.01±0.001 




# hinges = 2 


4.53±0.03 


0.08±0.01 




# hinges = 3 


4.47±0.04 


0.17±0.02 




# hinges = 4 


4.41±0.02 


0.28±0.02 




# hinges = 5 


4.44±0.06 


0.40±0.03 


SVR 


C = 32, (7 = 0.5, e = 0.5 


4.50±0.01 


1.68±0.01 



Table 5: Comparison results of modified K-plane regression approach with other regression 
approaches on Abalone Dataset. 
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Method 


Parameters 


MSE 


Time (sec) 


K-plane 


# hyperplanes=2 


10.34±0.22 


0.02±0.001 




# hyperplanes=3 


11.15±0.56 


0.02±0.003 




# hyperplanes=4 


13.08±1.10 


0.04±0.003 




# hyperplanes=5 


13.72±0.77 


0.05±0.003 


Modified K-plane 


# hyperplanes=2 


8.55±0.11 


0.006 


7=100 


# hyperplanes=3 


8.72±0.25 


0.01±0.003 




# nyperplanes=4 


8.82±U.75 


U.U1±U.0U1 




# hyperplanes=5 


8.83±0.69 


0.01±0.002 


Hinge Hyperplane 


# hinges = 1 


9.81±0.52 


0.003 




# hinges = 2 


9.03±0.53 


0.02±0.002 




# hinges = 3 


8.75±0.37 


0.03±0.01 




# hinges = 4 


8.58±0.35 


0.05±0.009 




# hinges = 5 


8.35±0.39 


0.08±0.005 


SVR 


C=16, cr = 1, e = 0.25 


6.80±0.26 


0.03 



Table 6: Comparison results of modified K-plane regression approach with other regression 
approaches on Auto-mpg Dataset. 



Method 


Parameters 


MSE 


Time (sec) 


K-plane 


# hyperplanes=2 


61.81±7.17 


0.39±0.05 




# hyperplanes=3 


19.48±0.49 


0.48±0.07 




# hyperplanes=4 


15.88±0.92 


0.92±0.13 




# hyperplanes=5 


19.98±1.03 


1.19±0.20 


Modified K-plane 


# hyperplanes=2 


154.45±12.61 


0.15±0.02 


7 = 100 


# hyperplanes=3 


23.47±1.56 


0.24±0.03 




# hyperplanes=4 


11.88±0.15 


0.48±0.04 




# hyperplanes=5 


11.80±0.24 


0.66±0.10 




# hyperplanes=6 


10.98±0.14 


0.70±0.16 


Hinge Hyperplane 


# hinges = 1 


29.40±21.57 


0.05±0.002 




# hinges = 2 


11.39±0.32 


0.17±0.017 




# hinges = 3 


10.77±0.27 


0.38±0.047 




jf hinges = 4 


10.66±0.33 


0.64±0.062 




# hinges = 5 


10.06±0.13 


0.98±0.055 


SVR 


C = 256, cr = 1, 6 = 0.5 


8.47±0.11 


23.16±0.31 



Table 7: Comparison results of modified K-plane regression approach with other regression 
approaches on Computer Activity Dataset. 
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We see that for all datasets, the MSE achieved by the simple JT-plane regression method 
is highest among all algorithms. Thus, though the iT-plane regression method is conceptu- 
ally simple and appealing, its performance is not very good. 

The modified A'-plane regression algorithm performs much better than if-plane regres- 
sion not only in terms of MSE but also in terms of time taken. The reason why modified 
-ftT-planes method takes lesser time is that it converges in fewer iterations. This happens 
because modified AT-plane regression algorithm gives importance to the connectedness of 
the clusters also. As a result, number of transitions of points from one cluster to another 
after every iteration are lesser and thus the clusters stabilize after fewer iterations. 

The performamnce of modified K-plane regression algorithm is comparable to that of 
hinge hyperplane algorithm in terms of MSE. It performs better than hinge hyperplane 
method on Housing dataset. On Auto-Mpg dataset, Abalone dataset and Computer Ac- 
tivity dataset, the minimum MSE of modified iT-plane regression approach is only a little 
higher than the minimum MSE of hinge hyperplane method. Modified K-plane regression 
algorithm is also faster than hinge hyperplane method on all data sets. 

On all problems except on the Abolone dataset, the SVR algorithms achieves better MSE 
than modified if-plane regression algorithm. However, we observe that modified AT-plane 
regression is significantly faster than SVR. In SVR, the complexity of dual optimization 
problem is 0(N 3 ), where N is the number of points. In contrast, in modified K-plane 
regression, at each iteration, the major computation is finding K linear regression functions. 
The time complexity of each iteration in modified i^T-plane regression is 0(K(d+ 1) 3 ) which 
is very less than 0(N 3 ) if N >> d. 

Thus, we see that overall, modified AT-plane regression is a very attractive method for 
learning nonlinear regression functions by approximating them as piecewise linear functions. 
It is conceptually simple and the algorithm is very efficient. Its performance is comparable 
to that of SVR or hinge hyperplanes method in terms of accuracy. It is significantly faster 
than SVR and is also faster than hinge hyperplane method. Further, unlike all other current 
regression function learning algorithms, this method is capable of learning discontinuous 
functions also. 

6. Conclusions 

In this paper, we considered the problem of learning piecewise linear regression models. 
We proposed an interesting and simple algorithm to learn such functions. The proposed 
method is capable of learning discontinuous functions also. Through simulation experiments 
we showed that the performance of the proposed method is good and is comparable to state- 
of-art in regression function learning. 

The basic idea behind the proposed method is very simple. Let S = {(xi, yi), . . . , (x^, yjv)} 
be the training dataset. We essentially want to find a way to partition the set {xi, • • • , xjy} 
into K sets such that we can find a good linear fit for the targets (i.e., yi) of points in each 
partition. The algorithm achieves this by repeatedly partitioning the points and fitting lin- 
ear models. After each model fit, we repartition the points based on the closeness of targets 
to the current models. We called this the A'-plane regression algorithm. This algorithm is 
conceptually simple and is similar in spirit to the X-means clustering method. While such 
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an idea has been discussed in different contexts, we have not come across this algorithm 
proposed and empirically investigated for nonlinear regression. 

Though this idea is interesting, as we showed here, it has several drawbacks. As the 
results in previous section show, this algorithm performs poorly even on one dimensional 
problems. 

In this paper we have also proposed a modification of the above method which performs 
well as a regression learning method. In our modified if-plane regression algorithm, during 
the process of repeatedly partitioning feature vectors and fitting linear models, we make the 
partions so that we get good linear models and, also, the points of a partition are all close 
together. This idea is easily incorporated into the algorithm by expanding the parameter 
vector to be learnt and by modifying the objective function to be minimized. The resulting 
algorithm essentially does one step of linear regression and one step of iT-means clustering 
in each iteration. 

Through empirical studies, we showed that the modified K -plane regression algorithm is 
very effective. Its performance on some real data sets is comparable to that of nonlinear SVR 
in terms of accuracy while the proposed method is much faster than SVR. The proposed 
method is better than the hinge hyperplane algorithm which is arguably the best method 
today for learning piecewise linear functions. Through two synthetic one-dimensional prob- 
lems, we also showed that the proposed method has better robustness to additive noise than 
the other methods and that it is capable of learning discontinuous functions also. 

We feel that the proposed method opens up interesting possibilities of designing algo- 
rithms for learning piecewise linear functions. As mentioned earlier, simultaneous estimation 
of optimal partitions and optimal models for each partition is computationally intractable. 
Hence an interesting and difficult open question is to establish theoretical bounds on the 
performance of the modified i^-plane regression method. While we showed that the method 
can be viewed as a limiting case EM algorithm under reasonable probability model, a lot of 
work needs to be done to understand how close to optimum can such methods converge to. 
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